Goto

Collaborating Authors

 frontier model


Amazon Has New Frontier AI Models--and a Way for Customers to Build Their Own

WIRED

Nova Forge lets Amazon's customers train frontier models for different tasks--a potential breakthrough in making AI actually useful for businesses. Amazon has announced a new family of frontier artificial intelligence models--and a new way for customers to build frontier models of their own. The ecommerce giant announced the second generation of its Nova AI models at re:Invent, a company conference held in Las Vegas. The models are nowhere near as popular as those offered by rivals like OpenAI and Google, but Amazon's plan to make them highly customizable could see them gain traction with its cloud users. Amazon detailed two improved large language models, Nova Lite and Nova Pro; a new realtime voice model called Nova Sonic; and a more experimental model called Nova Omni that performs a simulated kind of reasoning using images, audio, and video as well as text.


A Rosetta Stone for AI Benchmarks

Ho, Anson, Denain, Jean-Stanislas, Atanasov, David, Albanie, Samuel, Shah, Rohin

arXiv.org Artificial Intelligence

Most AI benchmarks saturate within years or even months after they are introduced, making it hard to study long-run trends in AI capabilities. To address this challenge, we build a statistical framework that stitches benchmarks together, putting model capabilities and benchmark difficulties on a single numerical scale. This acts as a "Rosetta Stone", allowing us to compare models across a wide range of abilities and time, even if they are not evaluated on the same benchmarks. Moreover, this works without assuming how capabilities evolve across time or with training compute. We demonstrate three applications of this framework. First, we use it to measure the speed of AI progress over time, and to forecast future AI capabilities. Second, we estimate the rate of improvements in algorithmic efficiency, finding estimates that are higher, but broadly consistent with prior work. Finally, we find that our approach can be used to detect rapid accelerations in AI progress.


AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

Jackson, Declan, Keating, William, Cameron, George, Hill-Smith, Micah

arXiv.org Artificial Intelligence

We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.


GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

Chatzikyriakidis, Stergios, Papadakis, Dimitris, Papaioannou, Sevasti-Ioanna, Psaltaki, Erofili

arXiv.org Artificial Intelligence

We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).


Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

Balasubramanian, Sriram, Basu, Samyadeep, Goswami, Koustava, Rossi, Ryan, Manjunatha, Varun, Santhosh, Roshan, Zhang, Ruiyi, Feizi, Soheil, Lipka, Nedim

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.


Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning

Yaron, Nissan, Bystritsky, Dan, Yaron, Ben-Etzion

arXiv.org Artificial Intelligence

We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI


QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture

Prakash, Shvetank, Cheng, Andrew, Tschand, Arya, Mazumder, Mark, Gohil, Varun, Ma, Jeffrey, Yik, Jason, Wan, Zishen, Quaye, Jessica, Alvanaki, Elisavet Lydia, Kumar, Avinash, Mazumdar, Chandrashis, Khare, Tuhin, Ingare, Alexander, Uchendu, Ikechukwu, Ghosal, Radhika, Tyagi, Abhishek, Wang, Chenyu, Garavagno, Andrea Mattia, Gu, Sarah, Guo, Alice, Hur, Grace, Carloni, Luca, Krishna, Tushar, Nayak, Ankita, Yazdanbakhsh, Amir, Reddi, Vijay Janapa

arXiv.org Artificial Intelligence

The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 72%) on these advanced questions, highlighting persistent gaps in architectural reasoning across analysis, design, and implementation QAs. By holistically assessing fundamental skills, QuArch provides a foundation for building and measuring LLM capabilities that can accelerate innovation in computing systems. With over 140 contributors from 40 institutions, this benchmark represents a community effort to set the standard for architectural reasoning in LLM evaluation.


Toward Cybersecurity-Expert Small Language Models

Levi, Matan, Ohayon, Daniel, Blobstein, Ariel, Sagi, Ravid, Molloy, Ian, Allouche, Yair

arXiv.org Artificial Intelligence

Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.


Position: Require Frontier AI Labs To Release Small "Analog" Models

Upadhyay, Shriyash, Bandi, Chaithanya, Oozeer, Narmeen, Quirke, Philip

arXiv.org Artificial Intelligence

Recent proposals for regulating frontier AI models have sparked concerns about the cost of safety regulation, and most such regulations have been shelved due to the safety-innovation tradeoff. This paper argues for an alternative regulatory approach that ensures AI safety while actively promoting innovation: mandating that large AI laboratories release small, openly accessible analog models (scaled-down versions) trained similarly to and distilled from their largest proprietary models. Analog models serve as public proxies, allowing broad participation in safety verification, interpretability research, and algorithmic transparency without forcing labs to disclose their full-scale models. Recent research demonstrates that safety and interpretability methods developed using these smaller models generalize effectively to frontier-scale systems. By enabling the wider research community to directly investigate and innovate upon accessible analogs, our policy substantially reduces the regulatory burden and accelerates safety advancements. This mandate promises minimal additional costs, leveraging reusable resources like data and infrastructure, while significantly contributing to the public good. Our hope is not only that this policy be adopted, but that it illustrates a broader principle supporting fundamental research in machine learning: deeper understanding of models relaxes the safety-innovation tradeoff and lets us have more of both.


The AI Industry's Scaling Obsession Is Headed for a Cliff

WIRED

The AI Industry's Scaling Obsession Is Headed for a Cliff Huge AI infrastructure deals assume that algorithms will keep improving with scale. A new study from MIT suggests the biggest and most computationally intensive AI models may soon offer diminishing returns compared to smaller models. By mapping scaling laws against continued improvements in model efficiency, the researchers found that it could become harder to wring leaps in performance from giant models whereas efficiency gains could make models running on more modest hardware increasingly capable over the next decade. "In the next five to 10 years, things are very likely to start narrowing," says Neil Thompson, a computer scientist and professor at MIT involved in the study. Leaps in efficiency, like those seen with DeepSeek's remarkably low-cost model in January, have already served as a reality check for the AI industry, which is accustomed to burning massive amounts of compute.